Comparing prediction outputs

In this notebook, we introduce the idea of comparing the outputs of predictions to what really happened: that is, "scoring" predictions.

  • At this stage, we shall not look at hot-spot production from a prediction, nor ways to compare these hotspots.

What are "predictions"?

We have implemented a variety of prediction algorithms. Some, for example the SEPP methods, have an explicit statistical model of a random process, attempt to fit data to that process, and then issue a prediction based on that fitting. Others, for example the classic Retrospective or Prospective hot-spotting techniques, produce a "ranking" of different grid cells, but make no real attempt to give meaning to the actual values of "risk" produced.

A prediction might also be on a network, but except for topological issues, we can think of this as simply a prediction with "grid" replaced by "edge of a network".

In principle, many prediction methods produce a (at least piecewise) continuous function which is then integrated to produce a grid prediction. At present, we have no way to use a continuous prediction, and so we shall not develop a way to compare them.

Summary:

  • A prediction should assign a relative risk to each grid cell, in such a way that when the cells are ordered, they are ordered from most risky to least risky.
  • A prediction may assign a genuine probability distribution to the grid, but we should not assume this.

What is "scoring"?

Simply put, some way of comparing a prediction to reality.

None of the prediction methods make any claim to predict that "a crime will definitely occur here" or that "a crime will not occur here". It is hence best to think of these predictions as probabilistic forecasts [2] (with the caveat above that some predicitons cannot be expect to produce true probability distributions). In Meteorology, Classification in machine learning etc. it is common to try to "predict" discrete events. There is a large lierature (e.g. the extremely readable [1]) here, but it does not seem directly relevant to our work. In particular, our meaning of "hit rate" is distinct from that used in comparing classification results.

Probabilistic forecasting of spatial data is considered in the field of Meteorology, see for example http://www.cawcr.gov.au/projects/verification/ As this site says:

In general, it is difficult to verify a single probabilistic forecast.

Instead, we compare a series of predictions against the series of actual events we are trying to predict.

The techniques

In other notebooks we will study in detail:

Hit rate with coverage

Pick a coverage level, say 10%, select the top 10% of grid cells flagged, and then calculate the percentage of actual events captured by this 10%.

Rank ordering

Convert the estimated probability density to percentiles (order from least to most risky, with most risky being set to 1.0 and least risky being set to 0.0). This is obviously related to picking a coverage level. Then evaluate the "percentile score" at each event location to obtain a series. Compute a summary statistic (e.g. mean) or compare this series against the series from another prediction.

Likelihood

Compute the normalised likelihood of the actual events for the probability distribution given by the prediction.

Kernel density estimation

Perform KDE on the actual events to generate a probability distribution, and then compare this to the prediction.

Scoring rules

Adapt the Brier score, and others, to our setting.

Bayesian techniques

Treat the prediction as a Bayesian prior, and use ideas from information theory to see how closely the posterior matches.

References

  1. Stephenson, "Use of the 'Odds Ratio' for Diagnosing Forecast Skill", Wea. Forecasting (2000), 15, 221–232, https://doi.org/10.1175/1520-0434(2000)015<0221:UOTORF>2.0.CO;2
  2. "Forecast verification : a practitioner's guide in atmospheric science" edited by Ian T. Jolliffe and David B. Stephenson. (Chichester : J. Wiley, 2003) Amazon link

In [ ]: